Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters
نویسندگان
چکیده
Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. Benchmark results are presented for three current x86-based microprocessors, showing clearly that our optimization works best on designs with high-speed shared caches and low memory bandwidth per core. We furthermore demonstrate that simple bandwidth-based performance models are inaccurate for this kind of algorithm and employ a more elaborate, synthetic modeling procedure. Finally we show that temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment, albeit with limited benefit at strong scaling.
منابع مشابه
Multicore-optimized wavefront diamond blocking for optimizing stencil updates
The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, es...
متن کاملEfficient multicore-aware parallelization strategies for iterative stencil computations
Stencil computations consume a major part of runtime in many scientific simulation codes. As prototypes for this class of algorithms we consider the iterative Jacobi and Gauss-Seidel smoothers and aim at highly efficient parallel implementations for cachebased multicore architectures. Temporal cache blocking is a known advanced optimization technique, which can reduce the pressure on the memory...
متن کاملSynchronization and Pipelining on Multicore: Shaping Parallelism for a New Generation of Processors
The potential for higher performance from increasing on-chip transistor densities, on the one hand, and the limitations in instruction-level parallelism of sequential applications and in the scalability of increasingly complicated superscalar and multithreaded architectures, on the other, are leading the microprocessor industry to embrace chip multi-processors as a cost-effective solution for t...
متن کاملAn Auto-tuning Jit Compiler for Accelerating Multiple Stencil Computations
We present a JIT compiler with auto-tuning capabilities fusing multiple stencil computations. Data arrays for scientific computing of image processing often exceed cache-memory size. To take advantage of spatial and temporal locality, a common method is to partition the images into tiling blocks for multicore architectures. In realistic scenarios, the multiple image algorithms, most of which ar...
متن کاملA Clustering Approach to Scientific Workflow Scheduling on the Cloud with Deadline and Cost Constraints
One of the main features of High Throughput Computing systems is the availability of high power processing resources. Cloud Computing systems can offer these features through concepts like Pay-Per-Use and Quality of Service (QoS) over the Internet. Many applications in Cloud computing are represented by workflows. Quality of Service is one of the most important challenges in the context of sche...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Parallel Processing Letters
دوره 20 شماره
صفحات -
تاریخ انتشار 2010